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1. Problem summary 

Consider the set of to x n binary matrices with row sums r := (ri, . . . , r,„) and 
column sums c (ci, . . . , c„), namely, 

n{r, c) := {z e {0, : ELi = EZi ^ij = c„ Vz, j} 

and denote the number of such matrices as 

N{r,c) := \n{r,c)\. 

The ideal goal is to sample from the uniform distribution over ^l{r,c), called 
the target distribution and denoted as 

where 1 is the indicator function and where f2(r, c) is assumed to be non-empty. 
Unfortunately, existing algorithms for exact uniform sampling are too slow for 
many applications. Importance sampling is a potential remedy, requiring the 
more modest goal of an approximately uniform distribution Q, called the pro- 
posal distribution^ that does permit fast sampling and for which Q{z) can be 
easily evaluated for any z. This paper describes a strategy for creating such 
a proposal distribution that often works well in practice. The techniques here 
are inspired by the approximately uniform proposal distribution introduced by 
Chen et al. (2005) (henceforth, CDHL). The main innovation here is the use of 
dynamic programming (DP) to combine algebraic constraints with combinato- 
rial approximations. 

2. Introduction 

Binary matrices (zero-one tables, directed graphs, bipartite graphs) arise in a va- 
riety of statistical contexts. They are used, for example, to represent occurrence 
matrices in ecology, discrctized point processes in neurophysiology, and connec- 
tivity graphs for social networks. Random generation of binary matrices is an 
important computational tool in these settings, especially for conditional infer- 
ence, in the spirit of Fisher's exact test for independence in two-way contingency 
tables. CDHL describe an algorithm for approximate uniform generation of bi- 
nary matrices with fixed row and column sums (margins, degree sequences). The 
uniform distribution is natural for a variety of conditional inference tests (and 
also for approximate counting), but exact sampling from the uniform distribu- 
tion over binary matrices with specified margins is computationally prohibitive. 

Although the CDHL distribution is only approximately uniform, the obser- 
vations can be reweighted to give consistent inferences via importance sampling. 
In Appendix A we discuss importance sampling as it is used in this context. See 
Liu (2001) for more details about importance sampling and see Chen (2007) for 
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further discussion of the importance samphng approach to random generation 
of matrices for statistical appUcations. 

The appeal of importance sampling is that the departure from uniformity can 
be quantified and corrected, because the probability of generating a given ma- 
trix under the proposal distribution is known exactly. Other approaches for uni- 
form sampling include Markov chain Monte Carlo (MCMC) algorithms with a 
uniform stationary distribution (e.g., Besag and Clifford, 1989; Bczakova ct al., 
2007; Ver heist, 2008) and randomized construction algorithms for which the 
probabilities of constructing a given matrix are unknown but which are asymp- 
totically uniform in the size of the matrix (e.g., Bayati et al., 2009). A drawback 
of MCMC approaches is that the mixing time can be difficult to quantify and 
will often be prohibitively long for large matrices (especially large sparse ma- 
trices). The MCMC algorithm in Bezakova et al. (2007), for example, mixes in 
polynomial time, but the polynomial is of a rather high order. Using an MCMC 
algorithm before the stationary distribution is reached can lead to converge 
failure for Monte Carlo estimates, or worse, apparent convergence to the wrong 
quantity. A generic importance sampling algorithm may exhibit similar patholo- 
gies (Bezakova et al., 2006, give an example with the CDHL proposal distribu- 
tion), but the proposal probabilities (the importance weights) tend to provide 
much more diagnostic information about convergence problems. Asymptotically 
uniform randomized construction algorithms provide very little diagnostic infor- 
mation and one can only hope that the asymptotics are valid. Recently, Blanchet 
(2006) has shown that a variant of the CDHL proposal distribution is asymptot- 
ically uniform in a certain sense, which represents the best of both worlds: the 
diagnostic and correction capabilities of importance sampling combined with 
guarantees that large problems will not create pathological behavior in the im- 
portance sampling estimates. The proof techniques in Blanchet (2006) should 
also extend to the proposal distributions described in this paper, but no attempt 
is made to do so here. 

Following CDHL, our goal in this paper is to create a proposal distribution, Q, 
over binary matrices that permits fast Monte Carlo sampling and that is as close 
as possible to the target distribution, P, of uniform subject to specified margins. 
As CDHL demonstrated and as we will see below, in situations with sparse 
or approximately constant margins it is possible to create practical proposal 
distributions that are extremely close to uniform. Consequently, there is little 
or no loss of statistical efficiency compared with exact uniform sampling. 

This paper describes some interesting modifications of the CDHL algorithm 
based on dynamic programming (DP) and on the recent asymptotic approxima- 
tions in Canfield et al. (2008) and GreenhiU et al. (2006). Much hke the CDHL 
procedure, the algorithms require 0(toX)"=i '^j) operations to generate a single 
sample, which scales well to large, sparse problems. The modified algorithms per- 
form extremely well in many cases, including cases with irregular margins. The 
improved performance over existing approaches is primarily a result of using the 
improved approximations in Canfield et al. (2008) and GreenhiU et al. (2006). 
While this improvement is, practically speaking, quite significant, it simply re- 
places the approximations used in CDHL with more recent findings. The DP 
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perspective, on the other hand, presents a more natural way to incorporate struc- 
tural constraints, like fixed margins or forced zeros, into an importance sampling 
approach. It substantially simplifies implementation, generalizes immediately to 
symmetric and integer-valued matrices and seems likely to be broadly applica- 
ble. Of the two modifications introduced here, the DP perspective seems to be 
the more theoretically interesting. 

We leave the investigation of most generalizations to future work and focus 
here on the situation where the row sums, r, and column sums, c, are fixed and 
the entries are constrained to be either zero or one. After describing this simple 
case, we generalize slightly to the case where at most one entry in each row 
and column is constrained to be zero, which includes the useful case of a zero 
diagonal. Chen (2007) also modified the CDHL procedure to allow for structural 
zeros. 

3. An approximately uniform proposal distribution for binary 
matrices 

This section describes the proposal distribution, Q, along with an efficient sam- 
pling algorithm. A key feature is that Q is defined (and sampled) recursively 
over columns. Some notational remarks: We continue with the notation and pre- 
cise problem statement from Section 1, including the use of bold font to denote 
vectors/matrices and normal font (with subscripts) to denote the elements of 
the corresponding vector /matrix, so that, for example, t -.^ {ti, . . . , tk). We also 
use the notation := to mean defined as, which can be important since several 
quantities are redefined throughout the text as we consider different underlying 
approximations . 

Exact sampling from P is impractical. Nevertheless, as CDHL note, in order 
to sample from P we only need to solve the (generic) problem of sampling from 
the distribution of the first column, which we denote as 

Piib) ^%-^t{b e 17i(r,c)} (3.1) 
I\ (r, c) 

where the support of this distribution is the set of feasible first columns, namely, 
f7i(r,c) := {b e {0,1}'" : 3z G n{r,c) with bi = z,i,Vi} 

and where 

c' := (c2, . . . , c„). 

(We abuse notation throughout and use fl and for matrices of any size, as 
determined by the lengths of the margin vectors.) To see that sampling from Pi 
is enough for the ultimate goal of sampling from P, note that the conditional 
distribution of the remaining sub-matrix given that the first column equals b is 
simply the uniform distribution over Q{r — b, c'). So we can recursively sample 
the columns, updating the new margins of the remaining submatrix based on 
the previously sampled columns. 

imsart-generic ver. 2008/08/29 file: cpsampling.tex date: June 4, 2009 



M. T. Harrison/Approximate Uniform Generation of Binary Matrices 



5 



Exact sampling from Pi is, of course, still impractical, but it suggests a useful 
strategy for constructing a proposal distribution Q. Namely, we can create a 
distribution Qi that is close to Pi and sample the first column from Qi- Then 
we can recursively sample the columns, updating the margins as we go. This 
recursive procedure implicitly defines a proposal distribution Q over the entire 
matrix, but we need only concern ourselves with Qi, the proposal distribution 
for the first column. This is often called sequential importance sampling, since 
the importance sampling distribution is defined in a sequential manner. 

Our goal then is to design a proposal distribution Qi that is close to Pi. As 
CDHL observe, approximations to N{r, c) can be used to formulate sensible 
choices for Qi. In general, if we are given an approximation 

iV(r,c) « N{r,c) 

with the property that N{r,c) > whenever N{r,c) > 0, then we can try to 
approximate Pi{b) in (3.1) by 

Piib)^^^t-^l{beni{r,c)}. 
N(r, c) 

Renormalizing this approximation to be a valid probability distribution gives a 
candidate proposal distribution defined by 

Qi{b) oc N{r - b, c')t{b e l^i(r, c)} (3.2) 

where oc means "proportional to" as a function of b. 

The remarkable fact is that (3.2) permits fast sampling via dynamic program- 
ming (DP) when used in conjimction with some of the best known asymptotic 
approximations for N{r,c). Before giving all of the details in subsequent sec- 
tions, we give an overview of the main ideas. To be precise we need to assume 
that ri > r2 > • ■ • > . This is not a problem since the rows can be permuted 
and then un-permutcd in deterministic preprocessing and postprocessing steps, 
as discussed in more detail below. 

The first important observation (Section 3.1) is that for these approximations 

m 

N{r~b,c')^'[[p1^{l-p.,)^-'' (3.3) 

1=1 

for some easily computable constants p G [0, 1]'" depending only on r and c, 
and for b G ^i{r, c). This means that 

?n 

Qi{b) o,Y[p'l^{l ~ p,)^-'''t{b e ni{r,c)} (3.4) 

1=1 

which is the conditional distribution of a sequence of m independent Bernoulli (pi) 
random variables (rv's) given that the sequence is in fli{r,c). 
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The second important observation (Section 3.2) is that the structure of 
r2i(r,c) also factors as 

m 

t{b e ni{r,c)} ^Y[t{h e A^}l{Y,Ulb^ e S,} (3.5) 

1=1 

for some easily computable sets , . . . , A„i C {0, 1} and Bi, . . . , B„i Q {0, 1, . . . , ci }. 
This factorization is a consequence of the Gale-Ryser conditions for the exis- 
tence of a binary matrix with certain specified margins (Gale, 1957; Ryser, 
1957). Combining the two pieces gives 

m 

Qi{b) K l[p1^{l~p,y-'^l{h e e B,} (3.6) 

i=l 

The third and final important observation (Section 3.3) is that (3.6) defines 
a Markov chain on the partial sums of the first column vector. In particular, 
using the change of variables 

i 

Si = '^bi i = 1,.. .,m 

£=1 

and defining sq = for convenience, we have that 

?n 

Q,{s) ^l[p',^-'-'{l - p,)^'^'^-'^-'h{s, - s,., e A,}l{s, e B,} (3.7) 
i=i 

Since Qi{s) factors into terms that only involve consecutive (si_i,Si) pairs, it 
must be a Markov chain. DP can be used to efficiently recover the transition 
probability matrices, at which point sampling the sequence of partial sums, and 
hence the first column, is trivial. In the next three sections (Sections 3.1-3.3) 
we provide details for each of these three main observations and then in Section 
3.4 we give an overview of the entire sampling algorithm. 

CDHL essentially noted both (3.4) and (3.5), but they did not use DP to 
combine the two. Instead, they tried two different approaches. The first began 
with (3.4) but replaced Qi{r,c) with the larger set {b : J^i^i — ^i}, namely, 
that the only constraint is that the total number of ones in the first column 
is correct. This simplified constraint reduces (3.4) to the case of conditional 
Bernoulli sampling for which efficient sampling algorithms exist (Chen and Liu, 
1997). (Interestingly, DP is one such algorithm, but not the one used by CDHL.) 
The simplified constraint makes the support of Qi larger than the support of Pi 
(in other words, the proposal generates invalid matrices) and can substantially 
reduce efficiency. To remedy this, CDHL make use of the factorization in (3.5), 
but they do not directly combine it with (3.4). Instead, they use a heuristic 
procedure in which the proposal distribution becomes a complicated mixture of 
distributions inspired by (3.4) and designed to exactly cover the support of P. 
We have not investigated the degree to which the heuristic procedure in CDHL 
agrees with the DP approach here. Certainly, the DP approach is much easier to 
implement and to analyze. The computational costs are more or less identical. 
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3.1. Approximations to N{r,c) 

The goal is to find an approximation to N{r, c) that permits the factorization 
in (3.3). Here is a simple, but generic result along these lines. 

Lemma 3.1. Suppose that 

m 

N{r, c) = go{c, to, n) Y[ gi{ri,c, to, n) (3.8) 
i=i 

whenever iV(r, c) > for some functions [gi : i = 0, . . . , to), where m and n are 
the respective number of elements of r and c. Then as b varies over fli{r,c), 
(3.3) holds with 

9i{ri - l,c',TO,n - 1) 

gi(ri,c , TO, ?i - 1) + g.i(ri - 1, c , to, n - 1) 

Lemma 3.1 applies to most of the approximations to N{r,c) that have ap- 
peared in the literature, but not to all of them. If Lemma 3.1 is not applicable, 
then the factorization in (3.3) is unlikely to hold exactly. Nevertheless, we can 
force the factorization in (3.3) with the following heuristic based on a Taylor's 
approximation. Define 1* to be the TO-length binary vector with a one at entry 
i and zeros elsewhere. Then 

N{r - b, c') = exp(log N{r - b, c')) 

« exp (logiV(r,c') + 21=1 log c')) (r, - 6, - r,)) 
« iV(r, c') exp (- J2Z^ 6,(logiV(r, c') - logiV(r ~l\c' 



=1 



for p defined by 

N{r - V,c') 



. -■ (3.10) 
N{r,c') + N{r - V,c') 

(3.9) and (3.10) give the same values for p whenever Lemma 3.1 applies, but 

(3.10) can be applied to any approximation, as long as it extends nicely to 
the potentially invalid margins (r,c') and (r — l'',c'). Note, however, that if 
Lemma 3.1 does not apply, then using p defined by (3.10) will lead to a proposal 
distribution that is different from (3.2) in that the original approximation, N , 
has been replaced by an additional Taylor's approximation. 

We suggest two approximations to N{r, c) that have been developed recently. 
Each of these are refinements of the suggestions in CDHL. For each positive 
integer £ and any nonnegative integer a we define 

[a]i := a{a~l) - ■■{a-£+ 1). 
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For a /c-vector t of nonncgativc integers we define 

Note that [r]i — [c]i whenever N{r, c) > 0, since each gives the number of ones 
in any binary matrix compatible with these margins. 

3.1.1. Canfield, Greenhill McKay (2008) 

The first approximation that we consider comes from Canfield et al. (2008) and 
is accurate (asymptotically) as long as the margins do not vary too wildly: 

«(-.K;:,:)K:)n(::)«.(4(' 

m 

M := fiir,c,m,n) := —— TTtI^I^* " WiM) 

[c]i(mn- [c]i) 

71 

V := t/(c,m,n) := — — TTtIK'^J " \-^VI"^)^ ■ 

[c]i(m7i- [c]i) 

(3.11) permits the factorization in Lemma 3.1 with 

\nj y 2[c]i{mn - [c]i) 'J 

(3.9) gives the Bernoulli probabilities, which simplify to 

r-iexpf/3(l-2(r, - [c']i/77i))) 

:= ^—^ (3.12a) 

7i-r,+ Ti cxpf /?(1 - 2{ri - [c']i/?77.)) j 

min — 1) / , , , N 

2[c']i(?Ti(n - 1) - [c']i) ' 

where we take 0/0 :— 0. Note that (3.11) is an improvement over the first 
approximation in CDHL which was just (3.11) without the exponential term, 
giving gi(ri, c, m, n) := (") and leading to the Bernoulli probabilities 

:= ^. (3.12b) 

71 

In most examples (3.12a) is a substantial improvement over (3.12b). (3.12a) 
tends to work well as long as the margins do not vary wildly. For certain patho- 
logical cases with wildly varying margins Bezakova et al. (2006) have shown 
that (3.12b) fails completely. (3.12a) also fails on these examples. 
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3.1.2. Greenhill, McKay & Wang (2006) 

The second approximation comes from Greenhill et al. (2006) and is accurate 
(asymptotically) for sparse matrices, except perhaps when the margins are ex- 
tremely variable: 



Ml 



iV(r, c) := — H : cxp (-ai (c) [r]2 - as (c) Ha - (c) [r]^) (3.13) 



a2(c) := - 
a3(c) 



2[c]l 2[c]l A[c]\ 
c]3 , [c]? 



3[c]f 2[c]l 

C]2 , [c]3 [c]i 



4[c]t 2[c]l 2[c]f 

where wc take 0/0 := 0. Lemma 3.1 docs not apply because the [rjj term does 
not factor appropriately. Using the additional approximation in (3.10) leads to 

rjexp(7(r, - 1)) \ 
1 + r, cxp(7(r^ - l)j 
7 := 2ai(c') + 3a2(c')(r, - 2) + 4a3(c')([r]2 - r, + l) 

(3.13) is an improved version of the approximation in O'Ncil (1969), which was 
mentioned in CDHL, although they did not use it for any of their reported re- 
sults. The O'Neil (1969) approximation replaces the argument of the exponential 
in (3.13) with — [r]2[c]2/(2[c]^). Lemma 3.1 now applies with 

1 / [fihich 

yi{i'i,(^, iiL, iiL) : = 

leading to 



g,(rj,c,m,m) := — exp , 



l + r,exp((r,-l)[c']2/[c']?) ' 

where we take 0/0 := 0. (3.14a) works extremely well as long as the resulting 
binary matrices are sparse, but tends to break down quickly as the matrices 
become more dense. Blanchet (2006) has shown that (3.14b) leads to sampling 
procedures that are exponentially efficient for large, sparse matrices. Presum- 
ably, (3.14a) behaves at least as well. Both happen to be exactly uniform for 
the pathological cases in Bezakova et al. (2006). 

3.2. The structure of I (r, c) 

In this section we give the details of the factorization in (3.5). 

For any fc-vector t we define t* := : ti > j} to be the number of elements 
of t that are greater then or equal to j, where j = 1,2,.... The sequence 
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t* := (^1,^21 • • • ) is called the conjugate of t. The next theorem refers to c'* 
which is the conjugate of the (n — l)-vector c' (c2, . . . , c„). 

Theorem 3.2. Chen et al. (2005). Assume that ri > r2 > ■ ■ ■ > r,„ and that 
N{r, c) > 0. Define A Ai x ■ ■ ■ x Am and B := Bi x ■ ■ ■ x Bm by 

■{0} tfr, = 0: 
{0,1} ifO<r,<n; 
,{1} ifr.i^n; 

|max{0,X;ki''^-ELiC^*},---,Ci| i/i < m; 
{ci} if i = m. 



B, := 



Lei b be a binary m-vector and let s denote the partial sums of b defined by 
Si := Yle=i bt- Then b G c) if and only if b E A and s £ B. 

The factorization in (3.5) is just a restatement of this theorem's conclusions. 
It is instructive to see how the theorem arises. For ri > • • • > ?>„, the Gale-Ryser 
conditions (Gale. 1957; Ryser, 1957) state that fl{r, c) is non-empty if and only 
if 

Z^Li ^ J2'e=i cl for i < TO and J2T=i '^^ = YJt=i 4 

So a potential first column vector b will be valid as long as b G A (obvious) and 
J7(r — b, c') is not empty (there must be some way to fill the remaining matrix). 
Ignoring for the moment that the entries of r — b may not be decreasing, we 
can apply the Gale-Ryser conditions to this last constraint to get 

Y^i^ii'^t - ^i) < J2\=i 4* for i < TO and J2T=ii'^i ^ = H7=i 4* 
which can be rearranged as 

&f > ~ Efci 4* for i < "1 and = (3.15) 

where in the last equality we have used the easy to verify property that X^fci 
J2lLi '^'i — "^i- looting that the partial sums must be in {0, . . . , ci}, we see that 
(3.15) is just the constraint that s € B. CDHL prove that nothing goes wrong 
by ignoring the fact that r — b may not be decreasing. 



3.3. Dynamic programming 

We have now arrived at the factorization of Qi given in (3.6). The goal in this 
section is to show how DP can be used to efficiently sample from Qi. The first 
step is to change variables from the binary vector b to its partial sums s. The 
distribution Qi in (3.6) on b is equivalent to the distribution Qi in (3.7) on s, 
which we reproduce here for easy reference; 

rn 

Ql{s) OC Y[pT~''-\l~P^)^-^''-'-'h{s^ - S^-1 G A,}l{s, G B,} 
i=\ 
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(Recall that wc define sq — for convenienee. Also note that using Qi here is 
an abuse of notation — we should technically give the distribution over partial 
sums a new name.) We can rewrite Qi{s) as 

rn 

Qi{s) (xY\_hi{st-i,Si) (3.16) 
1=1 

where 

/i,(s,_i,sO :=pf-^'-^(l-p,)'"^'■"'-^^l{s^-s,_l G A,}l{s, gH,} 

Recall that the constants p and the sets A and B each depend on the margins 
r and c, and that they are easy to compute and represent as shown in Sections 
3.1 and 3.2 above. So the functions hi are also easy to compute and represent. 
They can always be expressed as (ci + 1) x (ci + 1) matrices, but this is rather 
inefRcient since, for example, we know that hi{si-i, Si) = if Si — Si-i ^ {0, 1}. 
A more efhcient representation would be as (ci + 1) x 2 matrices, where the rows 
index the possible values of Si_i and the columns index the two possibilities of 
= Si_i and s.^ Sj_i + 1. 

Let S :~ {Si,...,S„i) denote the random sequence of partial sums with 
distribution given by (3.16) and define Sq for convenience. If we were given 
the standard Markov chain representation 

m 

Qiis)^l[TT,{s,\s,-i) (3.17) 

i=l 

where 7ri(s,j|si„i) :— Pr(S',; = .Si|S'i_i = Si_i), then generating a random obser- 
vation of S would be trivial. It is well known that DP can be used to convert 
from product representations like (3.16) into conditional probability represen- 
tations like (3.17). The next theorem summarizes DP in this context. 

Theorem 3.3. (See e.g. Frey (1998).) Let (5o, ^i, . . . , Sm) he a sequence of rv's 
where each Si takes values in the finite set Di and where Dq :~ {0}. Suppose 
there exists a sequence of functions hi : Di-i x Di t-^ [0, cxd) for i — 1, . . . ,7ti 
such that the distribution of (5*1, . . . , Sm) can be expressed as 

Pr{Si = si, . . . , 5"™ = Sm) oc n™ 1 hi{si-i, Si) 

Recursively (in reverse) define 

Prni'^m—l 7 ^m) • hmi^^rn—l-i'^rn) 

/3i(si_i, Sj) h^{si^i,Si)Y,s,+ieDi+i /3i+i(s,;, s,+i) for i ^ 1, . . . ,m - 1, 
where each Pi is defined over Di-i x Di. Then 

r, ( O \ O \ Pi{Si-l, Si) 

Pr(6i = St\Si-i = Si-i) = — — 

for each i = 1, . . . ,m. 
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Applying this theorem to the special structure of our problem, we first note 
that I3i{si-i, Si) = whenever hi{si^i, Si) ~ 0, so, as mentioned earlier, we can 
also represent each /3i as a (ci + 1) x 2 matrix. Similarly, the sums over Di and 
-Di+i in the DP algorithm include at most 2 nonzero elements. This means that 
the entire DP computation (in our case) takes at most O(mci) operations.^ 

The result of the DP computation is the standard Markov chain represen- 
tation for Qi{s) in (3.17). Initializing Sq = 0, we can recursively generate a 
random observation S and its probability Qi{S) in at most 0(m) operations, 
including at most m calls to a random number generator. Then we can convert 
from the partial sum representation back to the binary vector representation 
with m subtractions. This gives a single sample from Qi{b). The entire process 
of DP and sampling takes O(mci) operations. 

3.4- Overview of the algorithm 

In this section we bring everything together and present an overview of the 
algorithm. Here is the procedure to sample the first column. 

• Reorder the rows so that the row sums are decreasing: ri > • • • > r^. 

• Compute the Bernoulli probabilities p. Section 3.1 gives some suggestions. 

• Use Theorem 3.2 to compute the sets A and B. 

• Use Theorem 3.3 to compute the Markov chain representation of the par- 
tial sums. 

• Generate a random observation of the partial sums from this Markov chain 
representation and also compute its probability. 

• Deterministically convert the partial sums into the equivalent binary vec- 
tor. This is the first column for the reordered rows. 

• Return the rows to their original order. 

Neglecting the computation of p, the whole process takes 0{mci) operations.^ 
The heuristics for choosing p described in Section 3.1 require negligible amounts 
of additional computation."^ 

Once the first column has been sampled, the row and column sums are up- 
dated and the process is repeated, recursively, to generate successive columns. 

^It also important to note that the ft's in the recursion can be scaled by an arbitrary con- 
stant (even during the recursion) , which is often necessary in practice for controlling underflow 
and overflow. A good generic choice is to divide each /3i by its sum over both arguments. This 
does not affect the order of the number of operations. 

^Technically, there are an additional 0{m logm) operations required to sort the rows. How- 
ever, this need only be done once. The ordering of the rows can be efficiently updated without 
generic sorting during the recursive computation to sample the entire matrix. Furthermore, 
usually multiple random samples of the matrix are required, which also does not require ad- 
ditional sorting, so, in an efficient implementation, the sorting is best viewed as a negligible 
preprocessing step. 

^Before sampling each column, one can also deterministically assign the remaining entries 
of any rows and columns whose remaining entries must be all zeros or all ones (as evidenced 
by trivial row and column sums), and then remove these rows and columns from further 
consideration (cf., CDHL). This can be viewed as a modification of the heuristics for choosing 
p; we do not use it here. 
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The probabihtics of each column are multiplied together to generate the prob- 
ability of the entire matrix. Note that the successive values of p, A and B will 
depend adaptively on the previously sampled columns. Generating an entire 
m X n binary matrix takes 0{m[c]i) operations, where [c]i := X]j=i '^j 
total number of ones. 

The degree to which Q approximates P can be affected by a variety of deter- 
ministic preprocessing steps, such as, reordering the columns, swapping the roles 
of ones and zeros, or swapping the roles of rows and columns. Following CDHL, 
in the examples below we order the columns in order of decreasing column sums. 
(This reordering becomes necessary when we generalize to structural zeros.) We 
have not systematically explored preprocessing strategies. For any given prob- 
lem, of course, one could explore various types of preprocessing and also various 
heuristics for choosing p in order to select the best proposal distribution. 

The identical algorithm can also be used to evaluate Q{z) for any given z. 
For each column we simply omit the step of generating a random observation 
of the partial sums from the Markov chain representation and instead use the 
partial sums from the corresponding column in z. Using the Markov chain rep- 
resentation to compute probabilities remains the same. We make use of this in 
some of the examples below. 



4. Examples 

The algorithm was implemented in Matlab and computations carried out on a 
MacBook laptop with 2 GB of RAM and a 2.16 GHz dual core processor. The 
Matlab implementation is available from the author. The implementation can 
handle moderately large matrices, if they are sparse, although very large ma- 
trices of the type encountered with connectivity graphs in large social networks 
are still out of reach for a laptop. For example, generating a single observation 
for a 10^ X 10^ matrix with all row and columns sums equal to 2 takes about 2.5 
hours, whereas generating a single such 10"^ x 10'^ matrix takes about a second. 

Let Zi, . . . , Zn be iid observations (each a binary matrix) from the pro- 
posal distribution Q over ^l{r,c) and let Wi, . . . ,Wn be the (unnormalized) 
importance weights defined by Wk '■= Q{Zk) ■ We quantify the empirical 
performance of an importance sampling procedure with two summary statistics 
based on the importance weights. The first is the ratio of maximum to minimum 
importance weights and the second is an estimate of the coefficient of variation, 
namely, 

^ ,^ ^-n^^,^.,...MW, ^^^^ . , ^ 
Tamk=i,...,NWk w 
where W and are the sample mean and sample variance, respectively, of 
Wi , . . . , Wn ■ We consider Awl and cw^ « to be indicative of good perfor- 
mance, although it is easy to imagine situations where these would be mislead- 
ing. We also report W and its estimated standard deviation, S-^ :— ^ S"^ / n. 
Note that is a consistent point estimate of A^(r, c). See CDHL for more details 
about using and evaluating importance sampling procedures in this context. 
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Tabic 1 details the speed of the algorithm on 1000 x 1000 binary matrices 
with all row and column sums identical. There is no practical difference in 
speed between the different heuristics for choosing p. These run-times are merely 
meant to provide a feel for how the algorithm behaves — no attempt was made to 
control the other processes operating simultaneously on the laptop. Presumably 
a careful C or assembly language implementation would run much faster. 



Table 1 

Time needed to generate a single observation using m = n = 1000 and r\ = ri = c 



1 '"1 


2 1 4 


8 


16 


32 


64 


128 


256 


512 1 


1 time/iV 


1.2 s 1 1.6 s 


2.4 s 


4.0 s 


6.7 s 


12.4 s 


24.4 s 


39.2 s 


46.6 s 1 



Table 2 reports A on these examples using N = 1000 for each of the heuristics 
(3.12a), (3.12b), (3.14a), and (3.14b) for choosing p. (In the tables, oo means 
that the floating point precision of the machine was exceeded.) (3.12a) was the 
best in each case — the maximum and minimum importance weight are within 
a few percent of each other — and it also had the smallest cv"^ in each case. 
Table 3 reports all of our measures of importance sampling performance on these 
examples for the best case of (3.12a). 



_ Table 2 

A for N = 1000 using m = n = 1000 and ri = = Cj for all i,j 



ri 


2 


4 


8 


16 


32 


64 


128 


256 


512 


(3.12a) 


1.049 


1.075 


1.041 


1.008 


1.005 


1.004 


1.004 


1.004 


1.004 


(3.12b) 


1.390 


1.618 


1.493 


1.449 


1.585 


1.486 


1.621 


1.512 


1.574 


(3.14a) 


1.080 


1.082 


1.146 


1.375 


2.610 


4.769 


oo 


oo 


oo 


(3.14b) 


1.172 


1.215 


1.487 


2.216 


4.788 


162.9 


oo 


oo 


oo 



Table 3 

Performance of (3.12a) for N = 1000 using m = n = 1000 and ri = ri = Cj for all i,j 



ri 


A 


^ 2 
CV 




1 


1.049 


4.2 X lO"*^ 


(1.75148 ± 0.00011) X 10^13^ 


2 


1.075 


6.4 X 10-*^ 


(7.64296 ± 0.00061) x lo^^^o 


3 


1.041 


2.1 X 10"^ 


(1.01879 ±0.00005) x loi**53i 


4 


1.008 


3.9 X 10-^ 


(2.31580 ± 0.00005) x lo^^^^^g 


5 


1.005 


2.3 X 10"^ 


(6.50167 ± 0.00010) x 10^^218 


6 


1.004 


2.2 X 10-^ 


(1.22048 ± 0.00002) x 10^"°'^^'^ 


7 


1.004 


1.8 X 10"^ 


(9.38861 ± 0.00013) x 10^^'^'^02 


8 


1.004 


2.2 X 10"^ 


(6.70630 ± 0.00010) x 10243964 


9 


1.004 


2.2 X 10-^ 


(5.02208 ± 0.00007) x lO^f^Tii 



For some less balanced 50 x 100 examples, take f := (24, 22, 22, 17, 17, 17, 
17, 13, 13, 13, 12, 12, 11, 11, 11, 10, 10, 9, 9, 9, 8, 8, 8, 8, 8, 8, 7, 6, 6, 6, 6, 5, 5, 
5, 5, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2) and c := (12, 12, 10, 10, 9, 9, 9, 9, 
9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
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5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) and 
consider row and column sums of the form r := kf and c := kc for A; = 1, . . . , 4. 
Generating a single observation takes between 0.012 s and 0.016 s depending 
on k. Table 4 summarizes A as before, but this time with N = 10^. (3.12a) 
is best except for the case fc = 1, when (3.14a) is more uniform. In general, 
the performance is much worse than in the regular case and using any of these 
importance sampling algorithms for the cases fc = 3, 4 seems suspicious because 
the proposal distributions are clearly very far from uniform. Table 5 reports all 
of our measures of importance sampling performance for (3.12a). For fc = 1, 
(3.14a) gives Sb^ = 4.4 x lO""* and VF± % = (2.3071 ± .0002) x 10'^''''. 



_ Table 4 

A for N = 10"" using r = kr and c = kc. 



k 


1 


2 


3 


4 


(3.12a) 


3.248 


6.825 


299.7 


4.19 X 10** 


(3.12b) 


145.1 


918.6 


5.09 X 10'^ 


9.47 X 10^^ 


(3.14a) 


1.225 


19.86 


2.74 X 10^ 


7.66 X 10^^ 


(3.14b) 


3.823 


7.33 X 10^ 


3.97 X 10^2 


2.68 X 10^3 



Table 5 

Performance of (3.12a) for N = 10^ using r = kr and c = kc. 



ri 


A 


^ 2 
CV 


w±s^ 


1 


3.248 


1.9 X 10"^ 


(2.3069 ± 0.0003) x lO"''*"' 


2 


6.825 


3.3 X 10-2 


(2.7975 ± 0.0016) x lO''^* 


3 


299.671 


5.4 X 10-^ 


(1.9741 ± 0.0046) X 10™^ 


4 


4.19 X 10* 


2.8 X 10^ 


(8.8545 ±0.1482) x lO'^^^ 



Bezakova et al. (2006) investigates the performance of CDHL's algorithm on 
pathological margins with very large ri and ci, but with all other row and 
column sums exactly 1. They prove that (3.12b) is extremely far from uni- 
form for such margins. Empirically speaking, it seems that (3.12a) is also quite 
far from uniform in such cases. Conversely, it is straightforward to show that 
(3.14a) and (3.14b) are exactly uniform for these types of margins. Following 
Bezakova et al. (2006), we experiment with the margins r = (240, 1, . . . , 1) and 
c = (179, 1, . . . , 1) for a 240 x 301 matrix. By conditioning on the entry in the 
first row and the first column and then using symmetry, one can see that 

iV(r,c) = H ^60! + H ^61! « ^.6843103 x 10- 

Generating a single observation takes about 0.077 s. (3.14a) and (3.14b) have 
Wk = N{r,c) for all fc since they are exactly uniform, giving A = 1, cv^ = 
(where we take 0/0 = 0) and W ± Sy^^ = N{r,c) ±0. (The equivalence is not 
truly exact because of rounding errors; in this case W agreed with N{r, c) to 12 
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digits.) On the other hand, using N = 10^ samples, (3.12a) gives A ~ 4.1 X 10 , 
= 1.7 X 10^, and W ± = (2.2 ± 0.3) x lO^^^. Similarly, (3.12b) gives 
A = 5.9 X IQi^ = 4.5 X 10^, and W ± S*^ = (4.4 ± 0.9) x lO^o^. Both 
are clearly far from uniform and approximate 95% confidence intervals of the 
form W±2Sy^^ do not contain N{r, c). For (3.12b), Bezakova et al. (2006) point 
out that ignoring the importance sampling diagnostics and relying only on the 
apparent convergence of W can be quite misleading. They also show that true 
convergence of W requires N to be exponential in the size of the margins. 

Finally, consider Darwin's finch data (see CDHL) which is a 13 x 17 occurrence 
matrix with r (14, 13, 14, 10, 12, 2, 10, 1, 10, 11, 6, 2, 17) and c := (4, 4, 11, 10, 
10, 8, 9, 10, 8, 9, 3, 10, 4, 7, 9, 3, 3). A single sample takes about 0.001 s. Table 6 
summarizes the performance based on = 10^ samples. For comparison, CDHL 
report that N{r, c) = 67,149,106,137,567,626 for the finch data. This example is 
interesting because none of the algorithms perform well, even though it is quite 
small, emphasizing the fact that the heuristics for choosing p arc motivated by 
asymptotic approximations and may perform poorly on small problems. 



Table 6 

Performance for N = 10® on Darwin's finch data. 





A 






(3.12a) 


2.8 X 10^ 


0.4363 


(6.722 ± 0.004) X 10^® 


(3.12b) 


5.6 X 10* 


1.1467 


(6.715 ±0.007) X 10^® 


(3.14a) 


1.0 X 10^ 


1.3710 


(6.727 ± 0.008) X 10^® 


(3.14b) 


5.4 X 10® 


4.0081 


(6.729 ±0.013) X 10^® 



These experiments suggest that (3.12a) is a good default choice, except for 
very sparse and/or irregular margins, where (3.14a) works better. (In the neu- 
roscience application that motivated this research, the margins are sparse and 
irregular and (3.14a) has far superior performance over (3.12a).) 

4.1- External checks on uniformity 

The previous experiments are based primarily on the internal diagnostics of 
samples from the proposal distribution Q. Other than Blanchet (2006) 's analysis 
of (3.14b) (which is not even one of the algorithms we finally recommend) and 
the fact that p is based on sensible heuristics, there are no external checks on the 
uniformity of Q. Using a complicated, high dimensional proposal distribution 
without external checks can be dangerous. Indeed, consider the following worst- 
case scenario. Suppose that il{r,c) = E U E'^, where E is much smaller than 
E'^, and suppose that Q is uniform over each of E and but far from uniform 
over 0(r,c), namely, 

Qiz) = ^t{z e E} + j^l{z e E'-} for M « , « i 
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If e is extremely tiny, say e — 10^^"°, then Monte Carlo samples from Q will 
(practically speaking) always lie in E, which itself is a tiny fraction of il(r, c). 
Furthermore, all internal diagnostics will report that Q is exactly uniform, since 
it is uniform over E. But, of course, statistical inferences based on samples 
from Q will tend to be completely wrong. This section describes two types of 
experiments designed to provide external checks on the uniformity of Q. 

For the first set of experiments we generate a binary matrix Z from the uni- 
form distribution over all binary matrices with row sums r. This is easy to do 
by independently and uniformly choosing each row of Z from one of the (") 
possible configurations. Since the conditional distribution of Z given its columns 
sums C is uniform over f2(r,C), we can view Z as a single observation from 
the uniform distribution over f2(r,C). (Of course, there is no practical way to 
uniformly and independently generate another such Z with the same C.) No- 
tice that Q{Z) gives external information about the uniformity of Q for these 
margins, since it gives the value of Q at a uniformly chosen location in ri(r, C). 
Indeed, in the pathological thought experiment described above, Z would al- 
most certainly be in E'^ and Q{Z)^^ would be substantially larger than any 
of the importance weights. Alternatively, if Q is nearly uniform, then Q{Z)~^ 
should be indistinguishable from the other importance weights. In summary, we 
can compare Q to P by comparing the importance weights to Q{Z)~^. (This 
observation can also be used to give valid Monte Carlo p- values with importance 
sampling, even if the importance sampling distribution is far from the target 
distribution.) 

Each experiment of this type proceeds identically. We fix m, n, and r. Then we 
generate L iid observations, say Z^, . . . , Z^, from the uniform distribution over 
all m X n binary matrices with row sums r. The column sums of these matrices 
are C^, . . . , C^. Then, for each i = 1, . . . , L, we generate N iid observations, 
say Z^, . . . , Z^, from the proposal distribution Q over f7(r, C^). We compute 
the ratio of maximum to minimum importance weights including the original 
observation for each namely, 

,_ inaxfc^o,...,Af QjZ^y^ 
minfc=o,...,Af Qi^iy^ 

and we report the final summary Amax := max£=i^...^L A^. We do this for each of 
the four heuristics for choosing p (using the same Zq, . . . , Z^ for each). If Amax 
is close to one, then this provides evidence that Q is approximately uniform over 
a large part of each of the fl{r, C^)'s. 

We begin with regular row sums as in table 2. Take m ~ n = 1000 and 
n = for aU i. We use L = 10 and iV = 10 for the cases ri = 2, 8, 32. Table 7 
reports Amax- The results are very close to uniform. 

For another example, take the row sums for the irregular 50 x 100 case that 
was used for table 4 and take A: = 1, i.e., r = f. We use L = 100 and N ~ 1000 
and find that A,„ax = 2.264, 25.42, 1.875, and 7.962 for (3.12a), (3.12b), (3.14a), 
and (3.14b), respectively. 

Taken with the results in the previous section these preliminary experiments 
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_ Table 7 

Amax for m = n = 1000 and ri = for all i. 



ri 


2 


8 


32 


(3.12a) 


1.0002 


1.0023 


1.0051 


(3.12b) 


1.0880 


1.2130 


1.3145 


(3.14a) 


1.0000 


1.0011 


1.0845 


(3.14b) 


1.0001 


1.0086 


1.7087 



suggest that (3.12a) and (3.14a) are somewhat uniform (certainly they are not 
pathological) over a large part of the sample space for margins like the ones 
considered here. Being somewhat uniform over most of the sample space is the 
primary requirement for a useful proposal distribution if the ideal goal is to 
sample uniformly over the sample space. 

For the second type of experiment, we try to design an extreme z* S i^{r, c) 
and compare the importance weights to Q{z*)^^. Again, if Q is approximately 
uniform over all of il{r,c) then Q{z*)^^ should be indistinguishable from the 
other importance weights. For these experiments we report 

^ ma^{Q{z*)-\Wi,...,WN} 
■ mm{Q{z*)-\Wi,...,WN} 

which should be close to one if the region in 51 (r, c) where Q is approximately 
uniform includes z* . 

Consider the regular case where m ^ n = 1000 and ri ~ ri = Cj for all 
Suppose that ri evenly divides 1000 and let z* be comprised only of disjoint 
ri X n blocks of ones. In particular, take z*j — 1 for {k — l)ri + 1 < i,j < kri 
and for = 1, . . . , 1000/ri. For each of the four p heuristics and for the cases 
ri = 2, 4, 8 we compute Q{z*)~^ and compare it to the data that generated the 
corresponding parts of table 2. Table 8 summarizes the results with A* defined 
above. The departure from imiformity is striking, especially for (3.12a) which 
performed spectacularly based on internal diagnostics. (3.14a) seems to be the 
best by this measure. Clearly, these algorithms are not within a few percent of 
uniform over all oiVl{r, c) as tables 2 and 7 might seem to suggest. 



_ Table 8 

A* for the corresponding part of Table 2. 



ri 


2 


4 


8 


(3.12a) 


1.741 


27.24 


6.25 X 10* 


(3.12b) 


43.03 


1.14 X 10^ 


1.24 X 10^2 


(3.14a) 


1.367 


1.567 


197.3 


(3.14b) 


1.353 


4.115 


4.52 X lO'^ 



For another example, consider the irregular 50 x 100 margins that were used 
for table 4 and take = 1, i.e., r = r and c = c. We construct a pathological z* 
as follows. Place ci ones in the last ci rows (corresponding to the smallest row 
sums) of the first column. Place C2 ones in the last available C2 rows of the second 
column, where a row is available if placing a one in that row will not exceed the 

imsart-generic ver. 2008/08/29 file: cpsampling.tex date: June 4, 2009 



M. T. Harrison/Approximate Uniform Generation of Binary Matrices 



19 



row sum for that row. Continue in this manner until aU the columns are assigned 
or until a column cannot be assigned successfully. In general, this procedure will 
not terminate successfully, but it does for this choice of margins. The resulting 
z* is unusual because rows and columns with large sums tend to have zeros at 
the intersecting entry. Using the data from the corresponding part of Table 4 
gives A* = 15.37, 2.1 x 10^ 4.7 x 10^, and 4.8 x 10^ for (3.12a), (3.12b), (3.14a), 
and (3.14b), respectively. Again, this shows that Q is not uniformly uniform over 
ri(r, c). The departure from uniformity is substantial enough that Q can only be 
used for importance sampling, and not, for example, as a proposal distribution 
for (cfhcient) rejection sampling to sample exactly from P. 

5. Exactly enforcing a zero diagonal 

In general, if we want to force certain entries to be either zero or one, then we 
can simply set the corresponding entry in p to be zero or one, respectively, in 
the dynamic programming algorithm above. This will generate matrices with 
the correct margins and with the desired forced entries. However, this may 
lead to situations where it is impossible to choose a valid column during the 
recursive generation of columns, since the margin constraints (in A and B) do 
not account for the forced entries. If the probability of failure is high, then Q 
will not be a useful proposal distribution. An interesting question is whether 
the margin constraints can be modified to account for forced entries so that Q 
always generates valid matrices. Chen (2007) investigates this question for the 
CDHL procedure and we do the same here, albeit for a very small class of forced 
entries — essentially, the case of forcing a zero diagonal. 

Square binary matrices are often used to represent the connectivity matrix of 
a directed graph. When these graphs cannot have self connections, the diagonal 
entries should be zero. Chen (2007) gives a version of Theorem 3.2 for this case 
and modifies the CDHL procedure accordingly. Greenhill and McKay (2007) 
and Bender (1974) provide the necessary asymptotics. The results are slightly 
more general than square matrices with zero diagonal — the only constraint is 
that each row and column can have at most one forced zero entry. 

Let a be a fixed and known m x n binary matrix and define 

Q{r,c,a) := {z e n{r,c) : ZijUij = 0, Vi,j} 

to be those m x n binary matrices with margins r and c and with zeros at any 
location that a has a one. These forced zeros are called structural zeros and 
the entries of a indicate their locations. N{r,c,a) and ^li{r,c,a) are defined 
similarly as before. N{r,c,a) denotes a generic approximation to N{r,c,a). 
Specific suggestions can be found in Section 5.1 below. 

Following exactly our earlier development, the target distribution, P, is the 
uniform distribution over Q{r,c,a) and the proposal distribution, Q, will be 
defined recursively over columns via 

Qi{b) oc N{r - 6, c', a')l{b e 17i(r, c, a)} 
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where a' is the to x (n — 1) matrix created by removing the first column from 
a, namely, 

a'jj := a,;.j-fi for l<i<m, 1 < j < n — I. 

If a has at most one entry of one in each row and column (indicating at most 
one zero in each row and column), then the factorization in (3.6) continues to 
hold (at least for the suggestions for N described below), where the constants 
Pi, ■ ■ . ,Pm and the sets Ai, . . . , Am and Bi, . . . , Bm depend on r, c, and a 
and are easy to compute. The algorithm described in Section 3.4 can be used 
exactly, with the only differences being modified choices of p. A, and B in 
order to account for the forced zeros indicated by a, and also a few additional 
constraints on the order in which the rows and columns are sampled. 

In the following sections we define ^ and C, to be the row and column sums 
of a, respectively, so that 

= # of structural zeros in the ith row, 
Qj = # of structural zeros in the jth column. 

As mentioned before, the final sampling algorithm relics on the assumption that 

< 1, < 1, Vi,j (5.1) 

Although quite restrictive, this includes the important case of enforcing a zero 
diagonal. 



5.1. N{r,c,a) 

We first present the asymptotic enumeration results corresponding to Section 
3.1. Each of the suggestions (3.12a), (3.12b), and (3.14b) for computing p has 
a corresponding correction that accounts for structural zeros. These corrections 
allow for arbitrary structural zeros and do not rely on the constraint that each 
row and column has at most one structural zero. The computational cost for 
the corrections is negligible. Lemma 3.1 becomes 

Lemma 5.1. Suppose that 

m 

N{r,c,a) = go{c,a,m,n)Y\_9i{ri,c,a,m,n) (5.2) 

i=l 

whenever N{r, c, a) > for some functions (gi : i = 0, . . . , m), where m and 
n are the respective number of elements of r and c. Then as b varies over 
fli{r,c,a), we have 

m 

N{r ~ b, c', a') (X Hp-' (1 " pO'""' (5-3) 

i=l 

where 



5j(rj - 1, c', a', TO, n - 1) 



gi{ri,c', a',m,n - 1) + gi{rt - 1, c', a', to, n - 1) 



(5.4) 
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Greenhill and McKay (2007) provide the modified asymptotic enumeration 
results corresponding to (3.11) and (3.12a). 



xexp^-i(l-M)(l-i^) -V 



(5.5) 



/i := ^i{r,c,m,n) := 7TT~~~~TTT 1]1 " WiM)^ 
[c\i{mn - [c]i) ^ 

n 

1/ := iy{c,m,n) ~ — — ^ \ , x V(cj - [c]i/n) 

[c]i(mn- [c]i) 

m n 

7] := ri{r,c,a,m,n) := —— TTT H 51 ^ - [c]i/n) 

[c]i(mn- [c]i) 

which permits the factorization in Lemma 5.1 with 



gi(ri,c,a,m,n) := 



r. 



A / mn{r, - [c]i/to) 
/ V 2 c i(m7i - c i) 



[c]i(mn- [c]i) ^ 

giving the BernoulU probabilities 



(cj - [c]i/t 



r,exp(/3(l-2(r, -[c']i/m))+(5, 
K := ^ J r- (5.6a) 

n-^,^-n+ n exp(^/3(l - 2(r, - [c']i/m)) + S^j 

m(n — 1) , , , ,N 

^ ^= or /1 / / r ,M 1 - C, m, n - 1 

2[c']i(m(n - 1) - [c'Jij ' 

2[c']i(m(n- 1) - [c']i) ^ 

where we take 0/0 :— 0. This is a refinement over the suggestion in Chen (2007) 
to use gi{ri,c,a,m,n) := ("~^') and leading to 

(5.6b) 



Modifications of (3.13) and the corresponding p in (3.14a) are not currently 
available. For the approximation that led to (3.14b), however, Bender (1974) 
gives the necessary modification for structural zeros: 

""•■°-'"" n".n!n'U.=,' °-i'W B j 
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which permits the faetorization in Lemma 5.1 with 

1 / hUch '^^EjLi 



g^{n,c,a,m,m) := — 7 cxp 



mi Ml 



(5.4) gives the Bernoulh probabiUties 

n exp((n - l)[c']2l[c']l + <c;./[c]i) 
■ 1 + r,exp((r, - l)[c']^ I [c']l + <,c'^Mi] 



(5.8) 



where we take 0/0 :~ 0. Chen (2007) also noted this approximation, but did 
not use it in any examples. 

Finally, we remark that any of the approximations in Section 3.1 could also 
be used in the case with structural zeros, since it is the support that enforces the 
zeros, not the Bernoulli probabilities. Ignoring the zeros in the choice of p will 
presumably cause a drop in performance, but this may be negligible, especially 
for large matrices. In particular, (3.14a) may still be quite useful for large sparse 
irregular matrices with structural zeros. 



5.2. r2i(r, c, a) 

Here we describe the discrete mathematics corresponding to Section 3.2. The 
goal is to show that ^i{r, c, a) factors like (3.5). This factorization relies on the 
assumption that a encodes at most one forced zero in each row and column. 

Theorem 5.2. Chen (2007) Fix an m x n binary matrix a with row sums ^ 
and column sums C such that max^ < 1 and maxj Cj < 1 • Define the location 
of the structural zero in each row by 



the unique j such that Oij = 1 «/ = 1 
n+1 if = 



Assume N{r, c,a) > and assume that the columns and rows are ordered so 
that ci > ■ • • > c„ and ri > ■ ■ • > with the further constraint that ri ~ r^+i 
implies that yi < yi+i- Define A :~ Ai x ■ ■ ■ x Am and B :~ Bi x ■ ■ ■ x B„i by 

{{0} i/rj = or a,i = 1; 
{0, 1} i/ < 7'i < n - ii and an = 0; 
{1} if J-j =71- ii and an = 0; 

|max{0, ri) - di}, . . . ,ci^ if i < m; 

{ci} if i — rn; 



B, := 



where 



min [i{j - 1) + ELj+i Cfc - Eki EL2 a^fc] 
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Let b be a binary m-vector and let s denote the partial sums of b defined by 
Si := X^fci ^i- Then b G ^i{r, c, a) if and only if b £ A and s G B. In other 
words, 

m 

l{bGOi(r,c,a)} =[]l{6, gAJ1{^^^i&, GB,} (5.9) 

1=1 

Sampling proceeds exactly as before. Simply replace Theorem 3.2 with Theo- 
rem 5.2 to compute the sets A and B. The only differences are that the columns 
must be sampled in order of decreasing column sums — we did this for efficiency 
in the examples above, but now it is imperative — and the structural zeros can 
affect how ties arc decided when ordering the rows. The additional computation 
required for prccomputing dk is negligible. 

6. Conclusions 

This paper improves upon the algorithms in Chen et al. (2005) for approximate 
uniform generation of binary matrices with specified margins. Such algorithms 
are useful for Monte Carlo approximate inference and counting (via importance 
sampling). An important aspect of the improvement here is the use of dynamic 
programming (DP) to exactly enforce the margin constraints. The appeal of the 
DP perspective is that it immediately suggests a variety of generalizations, such 
as, arbitrary structural zeros, symmetric matrices and integer-valued matrices. 
The primary challenges are to 

(a) formulate the constraints into a form that DP can use; 

(b) find approximate counting formulas that are easy to compute; 

If part (a) fails, it may be possible to formulate a subset of the constraints into a 
form that DP can use. As long as the support of the resulting Q is not too much 
larger than the support of P, importance sampling can still work well. Part (b) 
remains an important challenge. Fortunately, in the case considered here the 
approximations in Greenhill et al. (2006) and Canfield et al. (2008) are easy 
to compute and work well on most examples. As improved approximations for 
N{r,c) appear in the literature they can be used immediately via (3.10) to 
create improved proposal distributions. 

For the two approximations used here, namely the ones leading to (3.12a) and 
(3.14a), the resulting proposal distributions seem to be approximately uniform 
over a large part of the sample space for a wide class of margins, with (3.12a) 
working well when the margins arc close to regular and (3.14a) working well 
when the margins arc sparse. The case of highly irregular and dense margins 
does not seem to be handled well by either algorithm. The algorithms are not, 
however, uniformly uniform — there are regions (albeit very small regions) with 
substantially smaller proposal probabilities — so these proposal distributions 
cannot be used for efficient exact uniform sampling via rejection sampling. The 
ideal goal of fast exact uniform sampling from ri(r, c) remains elusive. 



imsart-generic ver. 2008/08/29 file: cpsampling.tex date: June 4, 2009 



M. T. Harrison/Approximate Uniform Generation of Binary Matrices 



24 



Appendix A: Importance sampling 

This section contains a description of importance sampling as it relates to the 
problem at hand. We want to generate a random object Z from a target dis- 
tribution P. In the main text is an m x n binary matrix and P is the uni- 
form distribution conditioned on certain row and column sums. Usually, the 
goal is to generate an independent and identically distributed (i.i.d.) sequence 
Zx, Z2, ■ ■ ■ , Zn with common distribution P for the purposes of Monte Carlo 
inference. In particular, we can Monte Carlo approximate the expected value 
under P (denoted Ep) of any function / via the strong law of large numbers, 
namely, 

n 

-^/(Z;,)^£;p[/(Z)] (A.l) 

fe=l 

almost surely as iV — > 00. 

For many purposes, though, it is enough to be able to sample from a different 
distribution Q, called the proposal distribution, for which the probability of 
samples are easy to evaluate and whose support contains the support of P, the 
target distribution. Expectations under P are related to expectations under Q 
via 

Ep[f{Z)] = fi^)P(^) = E fi^)P{^)Q{^)-'Q{^) 

z z 

= EQ[f{Z)P{Z)Q{Z)-^] 

where Ep is the expected value when Z has the target distribution P and Eq 
is the expected value when Z has the proposal distribution Q. This implies 
that we can generate Zi, Z2, . ■ . , Zn i.i.d. from the proposal distribution Q and 
Monte Carlo approximate Ep[f{Z)] using 

1 " 

-5^/(Zfc)P(Zfc)Q-i(Zfe) ^i?Q[/(Z)P(Z)Q(Z)-i] =i?p[/(Z)] (A.2) 
fe=i 

where the convergence happens almost surely as A'^ — > 00. This is called impor- 
tance sampling and usually the goal is to choose Q so that f{Z)P{Z)Q~^{Z) 
has small variance when Z has distribution Q. If this variance is smaller than 
the variance of f{Z) when Z has distribution P, then importance sampling will 
need fewer Monte Carlo samples than the original exact sampling because the 
convergence in (A.2) will happen faster than the convergence in (A.l). 

In our situation, however, it turns out that P is known only up to a constant 
of proportionality, say 

Piz) = K-^R{z) 

where R is known, but k is not. (In our case, P will be uniform and so R is 
simply the indicator function of the support of P.) Importance sampling can 
still be used, but now k must also be related to expectations under Q via 

K^J2 ^(^) = E Ri^)Qi^r'Qi^) = Eq [R{Z)Q{Z)-'] 

z z 
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and estimated with 



1 

N 



71 



Y,R{Zk)Q-\Zk) ^ Eq[r{z)q{z)-^] 



= K 



(A.3) 



fc=i 



Notice that efficient estimation of k requires that R{Z)Q{Z)^^ have small vari- 
ance when Z has distribution Q. In other words, we need the proposal distribu- 
tion to be very similar to the target distribution. This is quite different from the 
typical goal of importance sampling. 

We can now relate expectations under P and Q via 



where, again, Z^, Z-^, . . . , Zjsr arc i.i.d. Q. This procedure will be most efficient 
when the variability of both the numerator and the denominator are small. 
These may be competing demands. We do not address the challenging problem 
of designing a proposal distribution that balances variability in both the numer- 
ator and denominator. Rather, in the spirit of CDHL we merely try to reduce 
variability in the denominator by designing a proposal distribution that is as 
close to the target as possible. 

Choosing the proposal to closely match the target (and ignoring the function 
/) may seem like a strange goal when viewed from the typical importance sam- 
pling perspective. Indeed, perfectly achieving this goal makes the importance 
sampling approach in (A. 4) reduce to the exact sampling approach in (A.l). So 
we might expect that for many functions /, importance sampling is less efficient 
than exact sampling. However, suppose that we can generate observations from 
Q much faster than observations from P. Then the total computational time for 
importance sampling might still be faster than for exact sampling, even though 
larger sample sizes are required for importance sampling. This observation was 
the motivation for CDHL and it is the motivation here: Exact uniform sampling 
of binary matrices with margin constraints is computationally prohibitive; but 
approximate uniform sampling can be quite fast. 

It is also important to keep in mind that the exact sampling Monte Carlo 
approximation in (A.l) would be fine in many situations if we could sample 
efficiently from P. For example, in hypothesis testing / is usually the indicator 
function of some rejection region. While this may be a rare event, it is usually 
enough to determine that the probability is indeed small (say, smaller than 0.05, 
for example), and it is not necessary to accurately estimate the log-probability 
(which might require importance sampling in the usual sense) . So having a fast 
procedure that closely matches the statistical efficiency of (A.l) is very desirable. 



Ep[f{Z)] = EQ[f{Z)P{Z)Q{Zy'] ^ n~'EQ[f{Z)R{Z)Q{Z)-'] 
^ EQ[f{Z)RiZ)QiZ)-^] 
EQ[RiZ)QiZ)-^] 



and consistent Monte Carlo estimation proceeds via 



N-^j:i^,fiZMZk)Q-'iZk) EQ[f{Z)R{Z)Q{Z)-^] 
N-^Yrk=iRiZu)Q-^{Zu) ^ Eq[R{Z)Q{Z)-^] 




(A.4) 
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To summarize, our goal in the text is to design approximately uniform distri- 
butions over binary matrices with margin constraints that permit fast sampling 
algorithms. Monte Carlo inference (if that is what we want to do) can proceed 
via the left-hand side of (A. 4), where the function / is chosen by the practi- 
tioner, where the function R{z) will simply be the indicator function that z is 
a binary matrix with the appropriate margins, and where Q denotes the ap- 
proximately uniform distribution that generated the i.i.d. Monte Carlo samples 
Zx, Z2, . . . , Zn. 
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